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Abstract 

The elastic-input neuro tagger and 
hybrid tagger, combined with a neu¬ 
ral network and Brill’s error-driven 
learning, have already been pro¬ 
posed for the purpose of construct¬ 
ing a practical tagger using as little 
training data as possible. When a 
small Thai corpus is used for train¬ 
ing, these taggers have tagging accu¬ 
racies of 94.4% and 95.5% (account¬ 
ing only for the ambiguous words 
in terms of the part of speech), re¬ 
spectively. In this study, in order to 
construct more accurate taggers we 
developed new tagging methods us¬ 
ing three machine learning methods: 
the decision-list, maximum entropy, 
and support vector machine meth¬ 
ods. We then performed tagging ex¬ 
periments by using these methods. 

Our results showed that the support 
vector machine method has the best 
precision (96.1%), and that it is ca¬ 
pable of improving the accuracy of 
tagging in the Thai language. Fi¬ 
nally, we theoretically examined all 
these methods and discussed how 
the improvements were achived. 

1 Introduction 

The elastic-input neuro tagger and hybrid 
tagger, combined with a neural network and 
Brill’s error-driven learning, have already 
been proposed for the purpose of construct¬ 
ing a practical tagger using as little training 
data as possible. When a small Thai corpus 


is used for training, these taggers have tag¬ 
ging accuracies of 94.4% and 95.5% (account¬ 
ing only for the ambiguous words in terms 
of the part of speech), respectively. In this 
study, in order to construct more accurate 
taggers we developed new tagging methods 
using three machine learning methods: the 
decision-list, maximum entropy, and support 
vector machine methods. We then performed 
tagging experiments by using these methods. 
As supervised data for POS tagging in the 
Thai language we used the same corpus as in 
our group’s previous papers (|Ma et al., 1998; 


Ma et al., 1999 ; Ma et al., 2000| ). 

In connection with our approach, we should 
emphasize the following points: 


• In this work, we perfomed POS tagging 
in the Thai language by using the sup¬ 
port vector machine method. Although 
many studies have considered POS tag¬ 
ging by using machine learning methods, 
few studies have used the support vector 
machine method. This method achieves 
high perfomance, but it requires huge 
machine resources and does not work 
when we use large-scale corpora as su¬ 
pervised data. In addition, with large- 
scale corpora we can obtain good perfor¬ 
mance by using a simple method such as 
HMM (hidden Markov model). For the 
Thai language, however, large-scale cor¬ 
pora have not yet been constructed, so 
our apporach is effective. 


• We also carried out experiments by us¬ 
ing the decision list and maximum en¬ 
tropy methods for comparison, and we 
confirmed that the support vector ma- 











chine method produced the best preci¬ 
sion. This paper shows data comparing 
the performace. 

• The precision produced by the sup¬ 
port vector machine method was slightly 
higher than that obtained in a previous 
study (|Ma et al., 2000|) , which used the 
hybrid tagger combined with a neural 
network and Brill’s error-driven learning. 
Since our precision was slightly higher, 
we have improved the technology of POS 
tagging in the Thai language. 

2 Problems with POS tagging 

This study did not consider the segmentation 
of a sentence into words. We assumed that 
the words had been segmented before POS 
tagging began.[] In this case, a sentence is 
expressed as follows: 

S = (w 1 ,iu 2 ,- ■ -,w n ), (1) 

where w l is the i-th word in the sentence. 
POS tagging is the application of a POS tag 
to each word. Therefore, the result of POS 
tagging is expressed as follows: 

T = ( 2 ) 

where t l is the tag for the POS of word w l . 
Our goal is to determine the correct POS tag 
for each word. The categories indicated by 
the POS tags are defined in advance. POS- 
tagging problems can thus be regarded as 
classification problems and can be handled by 
machine learning methods. 

lr The Thai language is an agglutinative language 
like Japanese, and it thus has the problem of word 
segmentation in addition to POS tagging in morpho¬ 
logical analysis. This study did not consider word seg¬ 
mentation. To handle word segmentation, we have to 
make all possible segmentations by using a word dic¬ 
tionary and then perform a Viterbi search so that the 
probability for POS tagging and word segmentation in 
the whole sentence is as high as possible. This study 
focused on POS tagging, which would be one compo¬ 
nent of the Viterbi search. Because our approach uses 
machine learning methods, the probabilities were out¬ 
put with estimated results. Thus we can easily use 
this study as one component in the Viterbi search. 


3 Machine learning methods 

In this paper, we used the following three ma¬ 
chine learning methods Q 

• decision-list method 

• maximum-entropy method 

• support-vector machine method 

In this section, these machine-learning meth¬ 
ods are explained. 

3.1 Decision-list Method 

In this method, pairs consisting of a feature fj 
and a category a are stored in a list, called a 
decision list. The order in the list is defined in 
a certain way, and all the pairs are arranged in 
this order. The decision list method searches 
for pairs from the top of the list and outputs 
the category of the first pair with the same 
feature as a given problem as the desired an¬ 
swer. In this study, we use the value of p(a\fj) 
to arrange pairs in order. 

This decision list method is equivalent 
to the following method using probabilistic 
equations. The probability of each category 
is calculated by using one feature fj(€ F, 1 < 
j < k ), and the category with the highest 
probability is judged to be the correct cate¬ 
gory. The probability of producing a category 
a in a context b is given by the following equa¬ 
tion: 

P(a\b) =p{a\fmax), (3) 

where f m ax is defined as 

fmax = argmaxf jeF max ai£ A P(ai\fj), (4) 

such that p{aj\fj) is the occurrence rate of 
category a* when the context includes feature 
fr _ 

2 Although there are also such decision-tree learn¬ 
ing methods as C4.5, we did not use them for the 
following two reasons. First, decision-tree learning 
methods per form worse than the other methods on 
several tasks ( Murata et al., 200C| ; Taira and Haruno, 
S. Second, the number of attributes used in this 
research was very large, and the performance of C4.5 
would become worse if the number of attributes was 
decreased so that C4.5 could work. 










3.2 Maximum-entropy Method 


In this method, the distribution of probabili¬ 
ties p(a, b ) when equation (§) is satisfied and 
equation (6j) is maximized is calculated. The 
category with the maximum probability as 
calculated from this distribution of probabili¬ 


ties is judged to be the correct category (|Ris- 
tad, 1997| ; [Ristad, 199S| ): 


^ P{a,b)gj(a,b) = ^ p(a,b)gj(a,b ) (5) 

aeA,beB aeA,beB 

for Mfj (1 <j<k) 


H(p) = - p(a, b) log (p(a, b)) , (6) 

aeA,beB 

where A,B, and F are a set of categories, 
a set of contexts, and a set of features .f)(& 
F, 1 < j < k ), respectively; gj(a,b) is a func¬ 
tion with a value of 1 when context b includes 
feature fj and the category is a, and a value 
of 0 otherwise; and p(a, b ) is the occurrence 
rate of pair (a, b ) in the training data. 

In general, the distribution of p{a , b) is very 
sparse. We cannot use it directly, so we must 
estimate the true distribution of p(a, b) from 
the distribution of p(a,b). In the maximum- 
entropy method, we assume that the esti- 
mated value of the frequency of each pair of 
category and feature calculated from p(a, b) is 
the same as that calculated from p(a, b ) (This 
corresponds to Equation ^.). These estimated 
values are not so sparse. We can thus use the 
above assumption to calculate p(a,b). Fur¬ 
thermore, we maximize the entropy of the dis¬ 
tribution of p(a, b ) to obtain one solution of 
p(a,b), beacause using only Equation [| pro¬ 
duces many solutions for p{a,b). Maximiz¬ 
ing the entropy makes the distribution more 
uniform, which is known to provide a strong 
solution to data sparseness problems. 

3.3 Support-vector Machine Method 

In this method, data consisting of two cat¬ 
egories is classified by dividing space with a 
hyperplane. When the two categories are pos¬ 
itive and negative and the margin between 
positive and negative examples in the training 


Figure 1: Maximizing the margin 


data is larger (see Figure HE), the probabil¬ 
ity of incorrectly choosing categories in open 
data is thought to be smaller. The hyper¬ 
plane maximizing the margin is determined, 
and classification is done by using this hyper¬ 
plane. Although the basics of the method are 
as described above, for extended versions of 
the method, in general, the inner region of 
the margin in the training data can include a 
small number of examples, and the linearity 
of the hyperplane is changed to non-linearity 
by using kernel functions. Classification in 
the extented methods is equivalent to classi¬ 
fication using the following discernment func¬ 
tion, and the two categories can be classified 
on the basis of whether the output value of 
the function is positive or negative (|Cristian 


mi 


and Shawe-Taylor, 200C ; Kudoh, 200C|) : 


/(x) = sgn\^2 a iyiK(xi,x) + bj (7) 

_ maxi t y i= -ibi + min it y i= ibi 


bi = y^aj%^( x j,x,;), 

3 = 1 

where x is the context (a set of features) of 
an input example; x* and yi(i = 1y* € 
{1,-1}) indicate the context of the training 
data and its category, respectively; and the 

3 In the figure, the white circles and black circles 
indicate positive and negative examples, respectively. 
The solid line indicates the hyperplane dividing space, 
and the broken lines indicate planes at the boundaries 
of the margin regions. 




















function sgn is defined as 

sgn(x ) = 1 (x > 0), (8) 

— 1 ( otherwise ). 

Each Q.i[i = 1,2...) is fixed when the value of 
L(a) in Equation (jfi) is maximum under the 
conditions of Equations (1C) and (pd|). 

i i 

L (a) = X Qi “ 9 X a i a oyiVj K {* i, x j) (9) 


i,j = 1 


0 < a t < C {i = 1,...,/) 


( 10 ) 


X OiiVi = 0 


( 11 ) 


i=l 


4 Features (information used in 
classification) 

Although we have explained the three 
machine-learning methods, using these meth¬ 
ods requires defining the features (informa¬ 
tion used in classification). In this section, 
we explain these features. 

As mentioned in Section |j, when the result 
of word segmentation of a sentence in Thai 
language is input, we output the POS for each 
word. Therefore, the features are extracted 
from the input Thai sentence. Here, we define 
the following items as features. 

• POS information 


Although the function K is called a kernel 
function and various types of kernel functions 
can be used, this paper uses a polynomial 
function as follows: 


K(x,y) = (x • y + l) d , 


( 12 ) 


where C and d are constants set by experi¬ 
mentation. In this paper, C is fixed as 1 for 
all experiments. Two values of d, d = 1 and 
d = 2, are used. A set of x ? ; that satisfies 
a* > 0 is called a support vector, and the 
portion used to perform the sum in Equation 
(Q) is calculated by only using examples that 
are support vectors. 

Support-vector machine methods can han¬ 
dle data consisting of two categories. In gen¬ 
eral, data consisting of more than two cate¬ 
gories can be handled by using the pair-wise 
method ( Kudoh and Matsumoto, 2000|) . In 
this method, for data consisting of N cat¬ 
egories, all pairs of two different categories 
(N(N-l)/2 pairs) are constructed. Better cat¬ 
egories are determined by using a 2-category 
classifier (in this paper, a support-vector ma- 
chinc[] is used as the 2-category classifier.), 
and finally the correct category is determined 
on the basis of “voting” on the N(N-l)/2 pairs 
analyzed with the 2-category classifier. 

The support-vector machine method used 
in this paper is in fact implemented by com¬ 
bining the support-vector machine method 
and the pair-wise method described above. 


4 We use the software TinySVM (Kudoh, 2000) de¬ 
veloped by Kudoh as the support-vector machine. 


The candidate POS tags of the current 
word, the three previous words, and the 
three subsequent wordsf](e.g., “noun”, 
“verb”, etc. The total number of features 
in the Thai corpus is mentioned in Sec¬ 
tion |5|.) 

The candidate POSs were determined in 
advance for each word by using a word 
dictionary or the Thai corpus. 

• POS and order information 

The pair of candidate POS tags and their 
occurrence order in the current word, 
three previous words, and three sub¬ 
sequent wordf|(] (e.g., “noun, the first 

5 In general, since the words preceding the current 
word have already been analyzed, we can use only 
the one POS used in the current context, not possi¬ 
ble POSs. In fact, previous studies used the POSs of 
the results of tagging in the previous context. This 
paper, however, uses possible POSs in the previous 
context for the following two reasons. One is the ease- 
ness of processing, and the other is that we considered 
cases when the tagging in the previous context was 
performed wrongly. 

6 In Ma’s previous studies the probability of a 
POS for each word was used. The machine learning 
methods (decision list method and maximum entropy 
method) based on features as used in this paper, how¬ 
ever, are difficult to use with continual values such 
as probabilities in the features. Therefore, we used 
the occurrence order instead of the occurrence prob¬ 
ability. Since the order information is at most the 
number of ambiguities in POS and thus not so large, 
the machine learning methods used in this paper can 
handle the order. On the other hand, the support 
vector machine methods can handle continual values 
in the features. However, we used the occurrence or¬ 
der rather than the occurrence probability to enable 









place”, “verb, the second place”, etc. 
The total number of such features is 782.) 

The occurrence order indicates the fre¬ 
quency order of the POS in the training 
data when it is used for the current word. 

• word information 

The current word, three previous words, 
and three subsequent words (e.g., “tom- 
morow”, “go”, etc. The total number of 
such features is 15,763.) 

5 Experiments 


Table 1: Experimental results 


Method 

Precision 

Baseline method 

83.6% 

HMM 

89.1% 

Rule-based 

93.5% 

Elastic NN 

94.4% 

Hybrid tagger 

95.5% 

Decision list 

83.6% 

Maximum entropy 

95.3% 

Support vector machine 

96.1% 


(Precisions are as obtained for ambiguous 
words only.) 


This section describes our experiments on 
POS tagging in the Thai language by using 
the machine-learning methods described in 
Section ||| with the feature sets described in 
Section [|, for the tasks described in Section 

I . 

The experiments in this paper were per- 


refers to a method that performs POS tag¬ 
ging at the sentence level by using the hidden 
Markov model. “Rule-based” indicates Brill’s 
method, that is, the use of error-driven trans¬ 
formation rules. “Elastic NN” is a method 
our group proposed previously (Ma et al. 


formed by using the same Thai corpus as in 1999|) , using a three-layered perceptron in 


our previous papers (Ma et al., 1998; Ma et 


ah, 1999| ; |Ma et ah, 2000|) . This corpus con¬ 
tains 10,452 sentences randomly divided into 


which the length of the input layer is change¬ 
able. “Hybrid tagger” is another method 
our group proposed previously (Ma et ah 


two sets: one with 8.322 sentences, for train¬ 


ing; and the other with 2,130 sentences, for 
testing. The training and testing sets con¬ 
tain, respectively, 22,311 and 6,717 ambigu¬ 
ous words (in other words, the target words 
for POS tagging) .Q The ambiguous words are 
those that may serve as more than one POS. 
The other words always serve as the same 
POS, and they were assigned to a POS by us¬ 
ing a word dictionary rather than a machine 
learning method. 47 POSs are defined for the 
Thai corpus QCharoenpom et ah, 1997 ). 

The experimental results are shown in Ta¬ 
ble |]. The precisions for “Baseline method”, 


2000 ), combining the elastic NN and rule- 
based methods. It improves elastic NN by 
using Brill’s error-driven learning. The pre¬ 
cision of hybrid tagger was the best among 
our previous studies based on the Thai cor¬ 
pus used in this paper. The results in Table 
|l] for the other three methods (decision list 
method, maximum entropy method, and sup¬ 
port vector machine method) were obtained 
in this study. 

Among these three methods, the preci¬ 
sion of the support vector machine method 
(96.1%fl) was the best. This result is consis¬ 


tent with our other previous studies (|Murata 


‘HMM” 


“Rule-based”, “Elastic NN”, and et al., 2001a ; Murata et al., 2001b ). The pre- 


“Hybrid tagger” are from previous papers 
( [Ma et ah, 1999| ; Ma et ah, 2000| ). In the 
baseline method, a word is judged to repre¬ 
sent the POS that most frequently appears 
for that word in the training corpus. HMM 


comparison to the decision list and maximum entropy 
methods. In the future, we should use the occurrence 
probability in the support vector machine. 

'The total numbers of words including non- 
ambiguous words are 124,331 and 34,544, respectively. 


cision of the support vector machine method 
was also higher than that of hybrid tagger 
(95.5%), which had produced the best pre¬ 
cisions in the previous studies. Therefore our 
study has improved the technology of POS 
tagging in the Thai language. 


“The precisions shown in this paper were obtained 
using ambiguous words only. The precision for all 
words, including non-ambiguous words, was 99.2%. 






































Next, we compared the various methods. 
We first examined the three methods used in 
this paper. Since they used exactly the same 
features, the comparison was strict. The or¬ 
der of these methods was as follows: 


Support vector > Maximum entropy 
> Decision list 

The precision of the decision list method was 
very low and almost the same as that of the 
baseline method. This was because we did 
not use AND features (combination of fea¬ 
tures) as inputs for the system. We can thus 
say that by using only one feature the exper¬ 
iments were under adverse conditions for the 
decision list method. If we use AND features, 
the precision of the decision list method will 
increase,]] but when we make AND features 
randomly, the number of features increases 
explosively. When we add a small number of 
features, we need to throughly examine which 
combinations of features must be added. In 
contrast, the support vector and maximum 
entropy methods perform estimation by using 
all features. Furthermore, the support vec¬ 
tor machine method has a framework for con¬ 
sidering AND features automatically by ad¬ 
justing the constant d in the kernel function. 
We can thus say that the support vector ma¬ 
chine method is an effective machine learning 
method in that we do not have to examine 
AND features by hand. 

Next, we compared our methods with the 
previous methods. We have to do this care¬ 
fully, because the features used here did not 
match those used in the previous studies. We 
first compared the rule-based and hybrid tag¬ 
ger methods. These methods use not only 
POS information but also word information in 
the rule templates used in error-driven learn¬ 
ing. We can thus say that these methods use 
almost the same features as in this study, and 
therefore, they can be compared to the meth¬ 
ods used here. We can say that the order of 


9 A previous paper (Murata et al., 2000) showed 
that the decision list method can produce high preci¬ 

sions for bunsetsu identification in Japanese sentences 

by using AND features. In this study, the precision of 
the decision list method was bad because we did not 
use AND features. 


Table 2: Experimental results when word in- 
formation was eliminated_ 


Method 

Precision 

Decision list 

Maximum entropy 
Support vector machine 

78.0% 

92.3% 

93.9% 


(Precisions are as obtained for ambiguous 
words only.) 


the main machine learning methods was as 
follows:^] 

Support vector > Hybrid tagger 
> Maximum entropy > Rule-based 

Next we examined the HMM and elastic 
NN methods. These methods do not use word 
information directly: they only use the prob¬ 
ability of the occurrence of a POS in each 
word. We carried out our experiments by 
eliminating the features of word information 
to create similar conditions for these methods, 
as shown in Table ||. All methods produced 
lower precision in this case than when using 
word information. When we compared elas¬ 
tic NN (94.4%) and support vector machine 
(93.9%) with no word information, the for¬ 
mer had higher precision. Elastic NN, how¬ 
ever, uses the probability of the occurrence 
of a POS in each word, while support vec¬ 
tor machine uses word and order information 
instead. Since this provides less information 
than the probability of the occurrence of a 
POS, this is not a strict comparison. How¬ 
ever, from these results we expect that elastic 
NN should have performance as high as that 
of support vector machine J 9 * * * 11 ] As for HMM, we 
can say that it has lower performance than 
the support vector machine and maximum 
entropy methods, because its precision was 
much lower than for both of these methods. 

10 Strictly speaking, hybrid tagger used the AND 
features, while maximum entropy method can produce 
better precision when AND features are used. Thus, 
the order of “Hybrid tagger” and “Maximum entropy” 
could be changed. 

11 Although we have compared methods using differ¬ 
ent features, we should conduct experiments in which 
the features are the same. 








Finally we examined the reasons why we 
could improve the precision. The reason that 
the support vector machine method produced 
higher precision than the HMM and Elas¬ 
tic NN methods is that it uses word infor¬ 
mation as well. (“HMM” and “Elastic NN” 
did not use word information as mentioned 
above.) In some cases a POS is determined by 
a word in the previous or subsequent context, 
and in many of these cases the word informa¬ 
tion is very helpful. Next, we compared the 
support vector machine method to rule-based 
and hybrid tagger methods. Since almost the 
same information was used among them, we 
can expect that the support vector machine 
method should have better performance than 
the other methods. Since hybrid tagger in¬ 
cludes Brill’s error-driven learning, that is the 
rule-based method, the performance of hybrid 
tagger will deteriorate when the performance 
of the rule-based method is bad. We can thus 
say that we obtained better precision because 
we used word information and a support vec¬ 
tor machine with good performance. As for 
future work, we should conduct experiments 
by using word information in the elastic NN 
method. 


6 Conclusions 


In this paper, we examined POS tagging in 
the Thai language by using supervised ma¬ 
chine learning methods. As supervised data 
we used the corpus described in our group’s 
previouse papers ( Ma ct ah, 2000| ). We used 
the decision list method, the maximum en¬ 
tropy method, and the support vector ma¬ 
chine method as machine learning methods. 
In the experimental results, the support vec¬ 
tor machine method produced the best preci¬ 
sion. Its precision was slightly higher than the 
precision obtained in a previous study, which 
used a hybrid tagger combined with a neural 
network and Brill’s error-driven learning. 

We examined and compared various ma¬ 
chine learning methods, including those in 
previous studies. We discussed the good 
performance of the support vector machine 
method. We expected that elastic NN, which 
is one method from the previous studies, 


would also have good performance, but it does 
not use word information and its precision 
was lower than that of the support vector ma¬ 
chine mthod. We can say that our method 
in this paper produced better precision be¬ 
cause we used word information and because 
we used the support vector machine method 
whose performance is good. For the future 
work, we should conduct experiments by us¬ 
ing word information in elastic NN method. 
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